Level 0 merge #4697

SirTyson · 2025-04-23T02:41:32Z

Description

Resolves #4696

This change special cases level 0 Bucket merges to happen all in-memory, without file IO. To accomplish this, all level 0 buckets retain an in-memory vector of their contents. Instead of using file iterators to perform merges, these in-memory vectors are used. I've also reduced the number of times we index buckets during addBatch and use the in-memory vectors for indexing as well.

This reduces addBatch time by about 50% under high load as seen in max tps tests.

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

Copilot

Pull Request Overview

This PR specializes level 0 Bucket merges to operate entirely in-memory, reducing addBatch time under high load. Key changes include:

Updating upgrade tests to reflect new merge counter expectations.
Adding in-memory merge support in LiveBucket (including new state mEntries and associated APIs) and updating related merging logic.
Adjusting tests (BucketManagerTests, BucketIndexTests) and index construction (InMemoryIndex) to account for in-memory entries.

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/herder/test/UpgradesTests.cpp	Updates in upgrade test assertions to reflect revised merge counter values.
src/bucket/test/BucketManagerTests.cpp	Adds conditional index file existence checks based on configured page size.
src/bucket/test/BucketIndexTests.cpp	Modifies cache hit/miss expectations to subtract in-memory entry counts.
src/bucket/LiveBucket{.h,.cpp}	Introduces new in-memory merging support and related APIs (hasInMemoryEntries etc.).
src/bucket/InMemoryIndex{.h,.cpp}	Extends in-memory index construction to support new merge behaviors.
src/bucket/BucketOutputIterator{.h,.cpp}	Updates getBucket to optionally use in-memory index state during bucket adoption.
src/bucket/BucketListBase{.h,.cpp}	Introduces prepareFirstLevel for level 0 in-memory merges with fallback logic.
src/bucket/BucketBase{.h,.cpp}	Refactors mergeInternal and related merging logic to support new in-memory flows.
src/bucket/BucketMergeAdapter.h and HotArchiveBucket{.h,.cpp}	Minor updates to support passing putFunc instead of output iterators.

Comments suppressed due to low confidence (2)

src/bucket/HotArchiveBucket.h:67

Typo in comment: 'te' should be 'the'.

    // This function always writes te given entry to the output

src/bucket/test/BucketIndexTests.cpp:312

Ensure that subtracting 'sumOfInMemoryEntries' from 'numLoads' accurately accounts for in-memory entries in cache hit/miss metrics, and that tests remain robust across different environments.

                REQUIRE(hitMeter.count() == startingHitCount + numLoads - sumOfInMemoryEntries);

marta-lokhova

Nice work, I'm a fan of this change! I added a few suggestions for safer code plus a couple of clarification questions/requests for comments. Could you also share a bit more about performance gains you've observed while running this? I imagine fully in-memory addBatch flow would reduce most of addBatch overhead.

marta-lokhova · 2025-05-06T22:31:29Z

src/herder/test/UpgradesTests.cpp

@@ -2103,13 +2103,12 @@ TEST_CASE("upgrade to version 11", "[upgrades]")
            // Check several subtle characteristics of the post-upgrade
            // environment:
            //   - Old-protocol merges stop happening (there should have
-            //     been 6 before the upgrade, but we re-use a merge we did
-            //     at ledger 1 for ledger 2 spill, so the counter is at 5)
+            //     been 6 before the upgrade)


I'm not sure why this test changed. We should still be able to re-use merges, shouldn't we?

When we do the quick, in-memory merge, we don't use FutureBucket so we can't "re-use" it as the original comment suggests. This is fine though, as re-using a level 0 merge can only occur when level 1 is empty. This will happen in tests since we start at genesis, but won't happen in prod since we have TX volume.

marta-lokhova · 2025-05-06T22:33:43Z

src/bucket/test/BucketManagerTests.cpp

@@ -234,7 +234,15 @@ TEST_CASE_VERSIONS("bucketmanager ownership", "[bucket][bucketmanager]")
            std::string indexFilename =
                app->getBucketManager().bucketIndexFilename(b->getHash());
            CHECK(fs::exists(filename));
-            CHECK(fs::exists(indexFilename));
+
+            if (b->getIndexForTesting().getPageSize() == 0)


Curious, why did this change?

This test was configured to not use any in-memory indexes via the BUCKETLIST_DB_INDEX_CUTOFF config. With this change, we unconditionally store level 0 in memory regardless of the value of this flag. In memory indexes don't get serialized, so we have to add the check here. This is fine, since the memory overhead of level 0 is trivial, so there's no reason to support having no buckets in memory.

marta-lokhova · 2025-05-06T22:37:27Z

src/bucket/BucketBase.cpp

+    std::vector<BucketInputIterator<BucketT>> const& shadowIterators)
+{
+    // Don't count shadow metrics for Hot Archive BucketList
+    if constexpr (std::is_same_v<BucketT, HotArchiveBucket>)


We shouldn't be calling this function on HotArchiveBuckets at all, right? (shadows were dropped in protocol 12). Could we enforce that? I realize this was here before, but since we're already making changes, this is a good opportunity to harden the code.

I tried to factor out shadows as much as possible from HotArchive, preferring type dispatch over branching on types when possible. Unfortunately, shadows are embedded pretty heavily in many parts of the Bucket subsystem, but I was able to remove shadows from all HotArchive specific function headers, and removed some of them from BucketBase methods.

marta-lokhova · 2025-05-06T22:57:11Z

src/bucket/InMemoryIndex.cpp

@@ -36,10 +36,87 @@ InMemoryBucketState::scan(IterT start, LedgerKey const& searchKey) const
    return {IndexReturnT(), mEntries.begin()};
 }

+InMemoryIndex::InMemoryIndex(BucketManager& bm,


Looks like there's a lot of code duplication with the other InMemoryIndex constructor, would it be possible to refactor this? (Maybe MergeAdapter is really just an iterator rather than a container for two bucket inputs?)

marta-lokhova · 2025-05-06T23:10:06Z

src/bucket/InMemoryIndex.cpp

+    constexpr std::streamoff xdrOverheadBetweenEntries = 4;
+
+    // 4 bytes of BucketEntry overhead for METAENTRY
+    constexpr std::streamoff xdrOverheadForMetaEntry = 4;


How much do we win from in-memory indexing at commit time? The manual offset managing worries me a bit, as it's error-prone. We now need to maintain two different ways of managing offsets.

Difficult to say, as it's very setup dependent. On our validators, you'll probably see little change since the new file will be in the OS page cache given that stellar-core is the only thing running on the node and we have lots of excess RAM. On captive-core, it's a different story. You don't have much extra RAM for caches given RPC/Horizon/PSQL will dominate RAM usage. You're also less likely to be in the page cache given disk read contention with other processes, and the penalty for a file IO cache miss can be very high if you're on EBS. Personally I think since we have everything in user space memory, it's much better to index based on memory instead of file IO in the blocking path. I'll see if I can clean this up with some constexpr that are more dynamic based on BucketEntry XDR directly.

Hmm, it's actually difficult to get these values in any sort of constexpr sort of way. That being said, I think the hardcoded values are worthwhile. The performance savings can be very significant avoiding file IO altogether. Also, I think the risk is relatively minimal, as these constexpr values can't ever be changed. xdrOverheadBetweenEntries is intrinsic to the XDR spec, and xdrOverheadForMetaEntry is the size of BucketEntryType stored by the BucketEntry, which also can't be changed. Even if we changed BucketEntry, the cases within the switch statement might change size, but the overhead of the switch itself remains constant. We dynamically measure the size of the actual BucketEntry data via xdr::xdr_size(metadata), so I think these constants are fine.

marta-lokhova · 2025-05-06T23:12:35Z

src/bucket/BucketListBase.cpp

    {
+        // If we have an in-memory merged result, use that
+        setCurr(mNextCurrInMemory);


could we add a level 0 assert here as well?

marta-lokhova · 2025-05-06T23:17:29Z

src/bucket/BucketListBase.h

@@ -363,6 +363,7 @@ template <class BucketT> class BucketLevel
    FutureBucket<BucketT> mNextCurr;
    std::shared_ptr<BucketT> mCurr;
    std::shared_ptr<BucketT> mSnap;
+    std::shared_ptr<BucketT> mNextCurrInMemory;


Could we use std::variant to represent "next curr" in a single variable? (it'd be safer and less error-prone)

I tried to implement this via with std::monostate too, but unfortunately we use a single long-lasting FutureBucket object per level and reset it manually. Many callers assume this object exists, and it was too big of a refactor to implement initializing the FutureBucket correctly wrt monostate.

marta-lokhova · 2025-05-06T23:23:31Z

src/bucket/BucketListBase.cpp

+    auto snap = LiveBucket::fresh(
+        app.getBucketManager(), currLedgerProtocol, inputVectors...,
+        countMergeEvents, app.getClock().getIOContext(), doFsync,
+        /*storeInMemory=*/true, /*shouldIndex=*/false);


shouldIndex is false here because indexing is redundant: we merge this newly created bucket right away and create a new (persistent) index anyway. Is that correct? I think a comment with some explanation would be nice here.

marta-lokhova · 2025-05-06T23:26:12Z

src/bucket/LiveBucket.cpp

+        out.put(e);
+    }
+
+    // Store the merged entries in memory in the new bucket in case this is


I'm a bit confused about this comment: why do we merge level 1 in-memory?

Ya this comment is wrong and I need to correct it. We need to maintain the in-memory vector in curr since the next ledger will need to execute an in-memory merge with the output of this function before it gets snapped. What I meant to say in this comment is that when the curr bucket gets snapped, we don't use the in-memory entries for level 0 snap, but maintain the vector anyway since it will be GC'd soon. I confused level 0 snap with level 1. It's a little confusing since when you write to the BucketList there's an implicit level -1 snap that merges with level 0 curr

SirTyson · 2025-05-30T22:52:34Z

Should have addressed all comments and rebased on master.

SirTyson requested review from graydon, marta-lokhova and Copilot April 23, 2025 02:41

Copilot AI reviewed Apr 23, 2025

View reviewed changes

marta-lokhova assigned SirTyson Apr 29, 2025

marta-lokhova reviewed May 6, 2025

View reviewed changes

SirTyson added 5 commits May 29, 2025 17:50

In memory merge level 0 bucket

4693915

Refactored Bucket merge logic

595038c

Small fixes

94e3daa

Switched to a variant

b08ea78

Refactored in-memory merge constructor

10bc47d

SirTyson force-pushed the level-0-merge branch from fb9ed81 to cedf767 Compare May 30, 2025 22:47

SirTyson changed the base branch from release/v22.3.0 to master May 30, 2025 22:48

Factored shadows out of some HotArchive functions

285cd2c

SirTyson force-pushed the level-0-merge branch from cedf767 to 285cd2c Compare May 30, 2025 22:52

Level 0 merge #4697

Are you sure you want to change the base?

Level 0 merge #4697

Uh oh!

Conversation

SirTyson commented Apr 23, 2025

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

marta-lokhova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SirTyson May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SirTyson commented May 30, 2025

Uh oh!

Uh oh!

SirTyson May 30, 2025 •

edited

Loading